Neural Optical Understanding for Academic Documents

🚀Meta AI 出了一个 OCR 神器:Nougat!🎉

可以轻松将学术 PDF 文档转换为 MultiMarkdown。尤其擅长复杂数学公式。📚➡️🔍

扫描版的 PDF 也能转换!!!基于 Transformer 模型训练而成。🤖📘

一键安装,一键运行,开箱即用!

特性

学术论文转 Markdown(效果亲测):

Pasted image 20230829235550.png

扫描 PDF 转 Markdown:

Pasted image 20230829235635.png

扫描畸变 PDF 转 Markdown(来自官网实例):

Pasted image 20230829235739.png

使用体会

MultiMarkdown

输出格式是 MultiMarkdown,适合于学术文档写作。但我平时在 Obsidian 中使用 Markdown,诸如公式等格式细节,需要手动调整后才能在 Markdown 中正确展示。

公式

能将公式转换成正确的 LaTeX,特别强大!!!学术党喜极而泣有木有!!太幸福了。

表格

能识别表格,但输出的是 LaTeX 格式的表格,在 Markdown 中没法渲染。

图片

生成的文档中不包含图片。图片可以自行从 PDF 中提取。或者手动截图也不失为一个简单办法。

论文

作者:Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic,来自 Meta AI

Abstract

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

学术知识主要存储在书籍和学术期刊中,通常以PDF形式存在。但PDF格式导致语义信息的丢失,尤其是数学表达式。提出了Nougat(Neural Optical Understanding for Academic Documents),这是一个执行 OCR 任务的视觉变换器模型,用于处理学术文档到标记语言,并在新的学术文档数据集上展示了模型的有效性。

1 Introduction

The majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [1]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost.

Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR [2], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial.
现有的光学字符识别(OCR)引擎,如Tesseract OCR [2],擅长在图像中检测和分类单个字符和单词,但由于其逐行的方法,无法理解它们之间的关系。这意味着它们将上标和下标与周围的文本一视同仁,这对于数学表达式是一个重大的缺陷。在分数、指数和矩阵等数学符号中,字符的相对位置至关重要。

Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [3], capture the text of 12M [2] papers using GROBID [4], but are missing meaningful representations of the mathematical equations.

对 [2] 的注释:The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc

To this end, we introduce Nougat, a Transformer based model that can convert images of document pages to formatted markup text.

The primary contributions in this paper are

Pasted image 20230830002132.png
Figure 1: Our simple end-to-end architecture followin Donut [28]. The Swin Transformer encoder takes a document image and converts it into latent embeddings, which are subsequently converted to a sequence of tokens in a autoregressive manner
图1:我们的简单端到端架构遵循Donut [28]。Swin Transformer 编码器接受一个文档图像并将其转换为潜在嵌入,随后以自回归的方式将其转换为一系列的令牌。

Optical Character Recognition (OCR) is an extensively researched field in computer vision for a variety applications, such as document digitalization [2, 5], handwriting recognition and scene text recognition [6, 7, 8].

More concretely, recognizing mathematical expressions is a heavily researched subtopic. Grammar based methods [9, 10, 11] for handwritten mathematical expressions were improved upon by different encoder-decoder models. The fully convolutional model [12] was succeeded by various RNN decoder models [13, 14, 15, 16, 17], both for handwritten and printed formulas. Recently, the decoder [18, 19] as well as the encoder [20] were replaced with the Transformer [21] architecture.

Visual Document Understanding (VDU) is another related topic of deep learning research and focuses on extracting relevant information of a variety of document types. Previous works depend on pre-trained models that learn to extract information by jointly modeling text and layout information using the Transformer architecture. The LayoutLM model family [22, 23, 24] uses masked layout prediction task to capture the spatial relationships between different document elements.

Open source solutions with a related goal as ours include GROBID [4], which parses digital-born scientific documents to XML with a focus on the bibliographic data and pdf2htmlEX [25], that converts digital-born PDFs to HTML while preserving the layout and appearance of the document. However, both solutions can not recover the semantic information of mathematical expressions.

3 Model

Previous VDU methods either rely on OCR text from a third party tool [22, 23, 26] or focus on document types such as recepits, invoices or form-like documents [27]. Recent studies [28, 29] show that an external OCR engine is not necessarily needed to achieve competitive results in VDU.

The architecture is a encoder-decoder transformer [21] architecture, that allows for an end-to-end training procedure. We build on the Donut [28] architecture. The model does not require any OCR related inputs or modules. The text is recognized implicitly by the network. See Fig. 1 for an overview of the approach.

Encoder The visual encoder receives a document image xR3×H0×W0, crops the margins and resizes the image to fit in a fixed rectangle of size (H,W). If the image is smaller than the rectangle, additional padding is added to ensure each image has the same dimensionality. We use a Swin Transformer [30], a hierarchical vision transformer [31] that splits the image into non-overlapping windows of fixed size and applies a series of self-attention layers to aggregate information across these windows. The model output a sequence of the embedded patches zRd×N where d is the latent dimension and N is the number of patches.

Decoder The encoded image z is decoded into a sequence of tokens using a transformer decoder architecture with cross-attention. The tokens are generated in an auto-regressive manner, using self-attention and cross-attention to attend to different parts of the input sequence and encoder output respectively. Finally, the output is projected to the size of the vocabulary v, yielding the logits Rv.

Following Kim et al. [28], we use the implementation of the mBART [32] decoder. We use the same tokenizer as Taylor et al. [33] because their model is also specialized in the scientific text domain.

3.1 Setup

We render the document images at a resolution of 96 DPI. Due to the restrictive possible input dimensions of the Swin Transformer we choose the input size (H,W)=(896,672). The aspect ratio is in between the US letter and Din A4 format 2217<43<2. The document images are resized and then padded to achieve the desired input size. This input size allows us to use the Swin base model architecture [30]. We initialize the model with the pre-trained weights.

The Transformer decoder has a maximal sequence length of S=4096. This relatively large sizing is due to the fact that the text of academic research papers can be dense and the syntax for tables in particular is token intensive. The BART decoder is a decoder-only transformer with 10 layers. The entire architecture has a total of 350M parameters.

We also test experiment with a smaller model (250M parameters) with a slightly smaller sequence length of (S=3584) and only 4 decoder layers, where we start from the pre-trained base model.

During inference the text is generated using greedy decoding.

Training We use an AdamW optimizer [34] to train for 3 epochs with an effective batch size of 192. Due to training instabilities, we choose a learning rate of lrinit=5105 which is reduced by a factor of 0.9996 every 15 updates until it reaches lrend=7.5106.

3.2 Data Augmentation 数据增强

In image recognition tasks, it is often beneficial to use data augmentation to improve generalization. Since we are only using digital-born academic research papers, we need to employ a number of transformations to simulate the imperfections and variability of scanned documents. These transformations include erosion, dilation, gaussian noise, gaussian blur, bitmap conversion, image compression, grid distortion and elastic transform [35]. Each has a fixed probability of being applied to a given image. The transformations are implemented in the Albumentations[36] library. For an overview of the effect of each transformation, see Fig. 2.
在图像识别任务中,使用数据增强通常有助于提高泛化能力。由于我们仅使用数字出生的学术研究论文,因此我们需要采用多种变换来模拟扫描文档的不完善性和可变性。这些转换包括侵蚀、膨胀、高斯噪声、高斯模糊、位图转换、图像压缩、网格扭曲和弹性变换[35]。每种变换都有一个固定的概率被应用于给定的图像。这些转换在 Albumentations[36]库中实现。要查看每种变换的效果概述,请参见图2。

Pasted image 20230830003237.png
Figure 2: List of the different image augmentation methods used during training on an example snippet form a sample document.
图2:在样本文档的一个示例片段上进行训练时使用的不同图像增强方法的列表。

During training time, we also add perturbations to the ground truth text by randomly replacing tokens. We found this to reduce the collapse into a repeating loop significantly. For more details, see Section 5.4.

4 Datasets

To the best of our knowledge there is no paired dataset of PDF pages and corresponding source code out there, so we created our own from the open access articles on arXiv.4 For layout diversity we also include a subset of the PubMed Central 5 (PMC) open access non-commercial dataset. During the pretraining, a portion of the Industry Documents Library 6 (IDL) is included. See Table A.1 for the dataset composition.

Footnote 4: https://arxiv.org/

Footnote 5: https://www.ncbi.nlm.nih.gov/pmc/

Footnote 6: https://www.industrydocuments.ucsf.edu/

arXiv We collected the source code and compiled PDFs from 1,748,201 articles released on arXiv. To ensure consistent formatting, we first process the source files using LaTeXML 7 and convert them into HTML5 files. This step was important as it standardized and removed ambiguity from the LaTeX source code, especially in mathematical expressions. The conversion process included replacing user-defined macros, standardizing whitespace, adding optional brackets, normalizing tables, and replacing references and citations with their correct numbers.

Footnote 7: http://dlmf.nist.gov/LaTeXML/

We then parse the HTML files and convert them into a lightweight markup language that supports various elements such as headings, bold and italic text, algorithms, LaTeX inline and display math and LaTeX tables. This way, we ensure that the source code is properly formatted and ready for further processing.

The process is visualized in Fig. 3.

Pasted image 20230830004235.png
Figure 3: Data processing. The source file is converted into HTML which is then converted to Markdown. a) The LaTeX source provided by the authors. b) The HTML file computed form the LaTeX source using LaTeXML. c) The Markdown file parsed from the HTML file. d) The PDF file provided by the authors

PMC We also processed articles from PMC, where XML files with semantic information are available in addition to the PDF file. We parse these files into the same markup language format as the arXiv articles. We chose to use far fewer articles from PMC because the XML files are not always as rich in semantic information. Often times equations and tables are stored as images and these cases are not trivial to detect, which leads to our decision to limit the use of PMC articles to the pre-training phase.
PMC 我们还处理了来自PMC的文章,在PDF文件之外,还提供带有语义信息的XML文件。我们将这些文件解析为与arXiv文章相同的标记语言格式。我们选择使用更少的PMC文章,因为XML文件在语义信息上并不总是那么丰富。很多时候,方程式和表格都被存储为图像,这些情况不容易检测,这导致了我们决定仅在预训练阶段使用PMC文章。

The XML files are parsed into the same markup language as described above.

IDL The IDL is a collection of documents produced by industries that have an impact on public health and is maintained by the University of California, San Francisco Library. Biten et al. [37] provide high quality OCR text for PDFs from the IDL dataset. This does not include text formatting and is only used for pre-training to teach the model basic OCR of scanned documents.

4.1 Splitting the pages

We split the markdown files according to the page breaks in the PDF file and rasterize each page as an image to create the final paired dataset. During the compilation, the LaTeX compiler determines the page breaks of the PDF file automatically. Since we are not recompiling the LaTeX sources for each paper, we must heuristically split the source file into parts, which correspond to different pages. To achieve that we are using the embedded text on the PDF page and match it to source text.

However, figures and tables in the PDF may not correspond to their position in the source code. To address this issue, we remove these elements in a pre-processing step using pdffigures2 [38]. The recognized captions are are then compared to the captions in the XML file and matched based on their Levenshtein distance [39]. Once the source document has been split into individual pages, the removed figures and tables are reinserted at the end of each page.

For a better matching we also replaced unicode characters in the PDF text with corresponding LaTeX commands using the pylatexenc-library 8.

Footnote 8: https://github.com/phfaist/pylatexenc

Bag of Words matching First we extract the text lines from the PDF using MuPDF 9 and preprocess them to remove page numbers and potential headers/footers. We then use a Bag of Words model [40] with TF-IDF vectorizer and a linear Support Vector Machine classifier. The model is fitted to the PDF lines with the page number as label. Next we split the LaTeX source into paragraphs and predict the page number for each of them.

Footnote 9: https://mupdf.com/

Ideally, the predictions will form a stair case function but in practice the signal will be noisy. To find the best boundary points we employ a similar logic as decision trees and minimize a measure based on the Gini impurity

G[a,b](i)=(ba)(1p[a,b]2(i)p[a,b]2(i+1)),

where p[a,b](i) is the probability of choosing an element with the predicted page number (i) in the interval [a,b] that describes which paragraphs (elements) were considered for the split.

The best splitting position t in the interval [a,b] is then

t^i=argmint(G[a,t](i)+G[t,b](i)).

The search process starts with all paragraphs and for each subsequent page break, the lower bound of the search interval is set to the previous split position. See Fig. 4 for a visualization of an example page.

Pasted image 20230830010345.png
Figure 4: Example for splitting the paragraphs in the source code into different pages. The points in blue denote the page index predicted by the SVM.

Fuzzy matching After this first coarse document splitting we try to find the exact position within the paragraph. This is done by comparing the source text within the neighborhood of the predicted splitting position to the last sentences of the previous page of the embedded PDF text, and the first sentences of the next page using the fuzzysearch library10. If the two dividing points are at the same location in the source text, the page break is considered "accurate" and receives a score of 1. On the other hand, if the splitting positions differ, the one with the smallest normalized Levenshtein distance is selected and given a score of 1 minus the distance. To be included in the dataset, a PDF page must have an average score of at least 0.9 for both page breaks. This results in an acceptance rate of about 47% of all pages.

Footnote 10: https://github.com/taleinat/fuzzysearch

4.2 Ground truth artifacts

Because the dataset was pre-processed by LaTeXML, the markup version of the source code can contain artifacts and commands from unsupported packages. The HTML file may contain subsection titles with numbering even though they are not numbered in the PDF. There may also be instances where figures or tables are missing from the ground truth due to processing errors.

In addition, the splitting algorithm of the source code will in some cases include text from the previous page or cut off words from the end. This is especially true for "invisible" characters used for formatting, like italic, bold text or section header.

For PMC papers the inline math is written as Unicode or italic text, while display math equations or tables are often included in image format and will therefore be ignored.

Each of these issues reduces the overall data quality. However, the large number of training samples compensates for these small errors.

## 5 Results & Evaluation

In this section we discuss the results and performance of the model. For an example see Fig. 5 or go to Sec. B. The model focuses only on the important content relevant features of the page. The box around the equations is skipped.
在本节中,我们讨论模型的结果和性能。有关示例,请参见图5或转到B节。该模型只关注页面的重要内容相关特征。方程式周围的框被跳过。

Pasted image 20230830010822.png
Figure 5: Example of a page with many mathematical equations taken from [41]. Left: Image of a page in the document, Right: Model output converted to LaTeX and rendered to back into a PDF. Examples of scanned documents can be found in the appendix B.

5.1 Metrics

We report the following metrics on our test set.

Edit distance The edit distance, or Levenshtein distance [39], measures the number of character manipulations (insertions, deletions, substitutions) it takes to get from one string to another. In this work we consider the normalized edit distance, where we divide by the total number of characters.

BLEU The BLEU [42] metric was originally introduced for measuring the quality of text that has been machine-translated from one language to another. The metric computes a score based on the number of matching n-grams between the candidate and reference sentence.

METEOR Another machine-translating metric with a focus on recall instead of precision, introduced in [43].

F-measure We also compute the F1-score and report the precision and recall.

5.2 Text modalities

In a scientific research article, there are three distinct types of text: 1) plain text, which comprises the majority of the document, 2) mathematical expressions, and 3) tables. It is important to separately examine each of these components during the evaluation process. This is necessary because in LaTeX, there are multiple ways to express the same mathematical expression. While some variability has been eliminated during the LaTeXML pre-processing step, there still is a significant amount of ambiguity present, like ordering of subscript and superscript, equivalent commands with different notation (stackrel, atop, substack or frac, over), situationally interchangeable commands (bm, mathbf, boldsymbol, bf or left(, big(, etc.), whitespace commands, additional layers of brackets, and more. As a consequence, there can be a discrepancy between prediction and ground truth, even if the rendered formulas appear identical.

In addition, it is not always possible to determine, where a inline math environment ends and text begins, when writing numbers and punctuation Example: $\mathrm{H}_{0}$1, vs. H$_{0}1, $ \rightarrow) H({}_{0})1, vs. H({}_{0})1,. This ambiguity reduces both math and plain text scores.

The expected score for mathematical expressions is lower than for plain text.

5.3 Comparison

We present our results in Table 1. As expected, the mathematical expressions have the worst agreement with the ground truth. For the plain text, most discrepancies come from formatting ambiguities and missing text due to inline math, as described above. The output format of GROBID is an XML file, which we convert into a compatible markup language, similar to the PMC or arXiv files. To some extent, GROBID provides support for formulas in its output, but it identifies and stores them as the Unicode representations embedded in the PDF. We replace each Unicode symbol with its corresponding LaTeX command to increase the similarity. Additionally, GROBID mislabels small inline expressions as text. For identified formulas, GROBID stores the bounding box coordinates. We modify the program by sending the snippet to the external formula recognition software LaTeX-OCR [20]. This way we can also get a signal for math modality. The reported results in this section are quite poor, primarily due to the amount of missed formulas by GROBID and the equation prediction accuracy is affected by the quality of the bounding boxes. The performance of the embedded PDF text alone is better than GROBID, which is due to formatting differences for the title page or reference section.
我们在表1中展示了我们的研究结果。如所预期,数学表达式与实际数据的一致性是最差的。对于普通文本,大多数的差异来自于格式的歧义和因内嵌数学公式而导致的文本缺失,如上所述。GROBID的输出格式是XML文件,我们将其转化为与PMC或arXiv文件类似的标记语言。在一定程度上,GROBID在其输出中支持公式,但它将它们识别并存储为PDF中嵌入的Unicode表示形式。我们将每个Unicode符号替换为其对应的LaTeX命令,以增加其相似度。此外,GROBID会将小的内嵌表达式误标为文本。对于已识别的公式,GROBID会存储其边界框坐标。我们修改了该程序,将代码片段发送到外部的公式识别软件LaTeX-OCR [20]。这样我们也可以为数学模态获得一个信号。本节报告的结果相当不理想,这主要是因为GROBID错过了很多公式,而且方程式的预测准确性受到边界框质量的影响。单独使用PDF内嵌文本的性能比GROBID好,这是由于标题页或参考部分的格式差异所导致的。

Both Nougat small and base are able to outperform the other approach and achieve high scores in all metrics. We note that the performance of the smaller model is on par with the larger base model.

Pasted image 20230830011123.png
Table 1: Results on arXiv test set. PDF is the text embedded in the PDF file. The modality “All” refers to the output text without any splitting. *Number of parameters.

5.4 Repetitions during inference

We notice that the model degenerates into repeating the same sentence over and over again. The model can not recover from this state by itself. In its simplest form, the last sentence or paragraph is repeated over and over again. We observed this behavior in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. Getting stuck in a repetitive loop is a known problem with Transformer-based models, when sampled with greedy decoding [44].

It can also happen that the model alternates between two sentences but sometimes changes some words, so a strict repetition detection will not suffice. Even harder to detect are predictions where the model counts its own repetitions, which sometimes happens in the references section.

In general we notice this kind behavior after a mistake by the model. The model is not able to recover from the collapse.

Anti-repetition augmentation Because of that we introduce a random perturbation during training. This helps the model to learn how to handle a wrongly predicted token. For each training example, there is a fixed probability that a random token will be replaced by any other randomly chosen token. This process continues until the newly sampled number is greater than a specified threshold (in this case, 10%). We did not observe a decrease in performance with this approach, but we did notice a significant reduction in repetitions. Particularly for out-of-domain documents, where we saw a 32% decline in failed page conversions.

Repetition detection Since we are generating a maximum of (4096) tokens the model will stop at some point, however it is very inefficient and resource intensive to wait for a "end of sentence" token, when none will come. To detect the repetition during inference time we look at the largest logit value i=maxi of the ith token. We found that the logits after a collapse can be separated using the following heuristic. First calculate the variance of the logits for a sliding window of size B=15

VarWinB[](x)=1Bi=xx+B(i1Bj=xx+Bj)2.

Here is the signal of logits and x the index. Using this new signal we compute variances again but this time from the point x to the end of the sequence

VarEndB[](x)=1Sxi=xS(VarWinB[](i)1Sxj=xSVarWinB[](i))2.

If this signal drops below a certain threshold (we choose 6.75) and stays below for the remainder of the sequence, we classify the sequence to have repetitions.

During inference time, it is obviously not possible to compute the to the end of the sequence if our goal is to stop generation at an earlier point in time. So here we work with a subset of the last 200 tokens and a half the threshold. After the generation is finished, the procedure as described above is repeated for the full sequence.

Pasted image 20230830014658.png
Figure 6: Examples for repetition detection on logits. Top: Sample with repetition, Bottom: Sample without repetition. Left: Highest logit score for each token in the sequence (x), Center: Sliding window variance of the logits VarWinB[](x), Right: Variance of variance from the position to the end VarEndB[](x)

5.5 Limitations & Future work

Utility The utility of the model is limited by a number of factors. First, the problem with repetitions outlined in section 5.4. The model is trained on research papers, which means it works particularly well on documents with a similar structure. However, it can still accurately convert other types of documents.

Nearly every dataset sample is in English. Initial tests on a small sample suggest that the model's performance with other Latin-based languages is satisfactory, although any special characters from these languages will be replaced with the closest equivalent from the Latin alphabet. Non-Latin script languages result in instant repetitions.
几乎每一个数据集样本都是英文的。初步在小样本上的测试表明,该模型在处理其他基于拉丁文的语言时的表现是令人满意的,尽管这些语言的任何特殊字符都将被替换为拉丁字母中最接近的等效字符。对于非拉丁文书写的语言,会导致即时的重复

Generation Speed On a machine with a NVIDIA A10G graphics card with 24GB VRAM we can process 6 pages in parallel. The generation speed depends heavily on the amount of text on any given page. With an average number of tokens of 1400 we get an mean generation time of 19.5s per batch for the base model without any inference optimization. Compared to classical approaches (GROBID 10.6 PDF/s [4]) this is very slow, but it is not limited to digital-born PDFs and can correctly parse mathematical expressions.

Future work The model is trained on one page at a time without knowledge about other pages in the document. This results in inconsistencies across the document. Most notably in the bibliography where the model was trained on different styles or section titles where sometimes numbers are skipped or hallucinated. Though handling each page separately significantly improves parallelization and scalability, it may diminish the quality of the merged document text.

The primary challenge to solve is the tendency for the model to collapse into a repeating loop, which is left for future work.

6 Conclusion

In this work, we present Nougat, an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup. We apply recent advances in visual document understanding to a novel OCR task. Distinct from related approaches, our method does not rely on OCR or embedded text representations, instead relying solely on the rasterized document page. Moreover, we have illustrated an automatic and unsupervised dataset generation process that we used to successfully train the model for scientific document to markup conversion. Overall, our approach has shown great potential for not only extracting text from digital-born PDFs but also for converting scanned papers and textbooks. We hope this work can be a starting point for future research in related domains.

All the code for model evaluation, training and dataset generation can be accessed at https://github.com/facebookresearch/nougat.

7 Acknowledgments

Thanks to Ross Taylor, Marcin Kardas, Iliyan Zarov, Kevin Stone, Jian Xiang Kuan, Andrew Poulton and Hugo Touvron for their valuable discussions and feedback.

Thanks to Faisal Azhar for the support throughout the project.

References

输出

输出格式为 .mmd 文件。.mmd 文件格式是 MultiMarkdown 的扩展名,这是一个轻量级标记语言。MultiMarkdown 是 Markdown 语言的扩展,增加了对表格、脚注、参考文献等的支持。在提到的上下文中,.mmd 文件主要与 Mathpix Markdown 兼容,并使用了 LaTeX 表格。它特别适用于需要包含数学和科学内容的文档格式化。

网络资源


本文作者:Maeiee

本文链接:Neural Optical Understanding for Academic Documents

版权声明:如无特别声明,本文即为原创文章,版权归 Maeiee 所有,未经允许不得转载!


喜欢我文章的朋友请随缘打赏,鼓励我创作更多更好的作品!